Optimization based method for estimating the results of aggregate queries

ABSTRACT

A method for estimating the result of a query on a database wherein the database has data records arranged in tables and the database has a given workload consisting of queries that have been executed on the database. An expected workload is derived comprising a set of queries that can be executed on the database. A sample of a table is constructed by partitioning the data records into regions based on the queries in the expected workload such that no query in the expected workload selects a proper subset of any region. The estimation error is expressed as a function of a number of records selected per region and a number of records is allocated to each region in a manner that minimizes the estimation error. The query accesses the sample and is executed on the sample, returning an estimated query result. The expected workload can be derived by assigning a probability of occurrence to a query that is based on an amount of similarity between the query and the given workload and compiling a set of queries and their corresponding probability of occurrence.

TECHNICAL FIELD

[0001] The invention relates to the field of database systems. Moreparticularly, the invention relates to a method of estimating the resultof an aggregate query based on an expected database workload.

BACKGROUND OF THE INVENTION

[0002] In recent years, decision support applications such as On LineAnalytical Processing (OLAP) and data mining tools for analyzing largedatabases have become popular. A common characteristic of theseapplications is that they require execution of queries involvingaggregation on large databases, which can often be expensive andresource intensive. Therefore, the ability to obtain approximate answersto such queries accurately and efficiently can greatly benefit theseapplications. One approach used to address this problem is to useprecomputed samples of the data instead of the complete data to answerthe queries. While this approach can give approximate answers veryefficiently, it can be shown that identifying an appropriate precomputedsample that avoids large errors on any arbitrary query is virtuallyimpossible, particularly when queries involve selections, GROUP BY andjoin operations. To minimize the effects of this problem, previousstudies have proposed using the workload to guide the process ofselecting samples. The goal is to pick a sample that is tuned to thegiven workload and thereby insure acceptable error at least for queriesin the workload.

[0003] Previous methods of identifying an appropriate precomputed samplesuffer from three drawbacks. First, the proposed solutions use ad-hocschemes for picking samples from the data, thereby resulting in degradedquality of answers. Second, they do not attempt to formally deal withuncertainty in the expected workload, i.e., when incoming queries aresimilar but not identical to the given workload. Third, previous methodsignore the variance in the data distribution of the aggregatedcolumn(s).

[0004] One type of method for selecting a sample is based on weightedsampling of the database. Each record t in the relation R to be sampledis tagged with a frequency f_(t) corresponding to the number of queriesin the workload that select that record. Once the tagging is done, anexpected number of k records are selected in the sample, where theprobability of selecting a record t (with frequency f_(t)) isk*(f_(t)/Σ_(u)f_(u)) where the denominator is the sum of the frequenciesof all records in R. Thus, records that are accessed more frequentlyhave a greater chance of being included inside the sample. In the caseof a workload that references disjoint partitions of records in R with afew queries that reference large partitions and many queries thatreference small partitions, most of the samples will come from the largepartitions. Therefore there is a high probability that no records willbe selected from the small partitions and the relative error in usingthe sample to answer most of the queries will be large.

[0005] Another sampling technique that attempts to address the problemof internal variance of data in an aggregate column focuses on specialtreatment for “outliers,” records that contribute to high variance inthe aggregate column. Outliers are collected in a separate index, whilethe remaining data is sampled using a weighted sampling technique.Queries are answered by running them against both the outlier index aswell as the weighted sample. A sampling technique called “Congress”tries to simultaneously satisfy a set of GROUP BY queries. Thisapproach, while attempting to reduce error, does not minimize anywell-known error metric.

SUMMARY OF THE INVENTION

[0006] Estimating a result to an aggregate query by executing the queryon a sample that has been constructed to minimize error over an expectedworkload increases the accuracy of the estimate.

[0007] In accordance with the present invention, a method is used forapproximately answering aggregation queries on a database having datarecords arranged in tables. The invention uses as input a givenworkload, i.e, the set of queries that execute against the database. Thedata records in a table are accessed to construct a sample thatminimizes the estimation error based on the expected workload that isderived from the given workload. Subsequently, incoming queries aredirected to access the sample to determine an approximate answer. Thequeries are executed on the sample and an estimated query result isprovided. In a preferred embodiment, an estimated error of the estimatedanswer is provided with the estimated query result.

[0008] In an exemplary embodiment, the sample is constructed bypartitioning the table into regions based on the queries in the expectedworkload and selecting samples from the regions in a manner thatminimizes estimation error over the expected workload. The table ispartitioned into regions by grouping data records such that no query inthe given workload selects a proper subset of any region. Each regionmay be further divided into finer regions such that the records in eachfiner region have similar values. According to a feature of an exemplaryembodiment, the step of selecting samples from the regions is performedby expressing the mean squared estimation error as a function of thenumber of samples allocated to each region and allocating the samples tominimize the estimation error.

[0009] The expected workload can be identical to the given workload, orit can be a statistical model in which a probability of occurrencerelated to an amount of similarity between the query and queries in thegiven workload is assigned to each possible query. Predeterminedconstants related to the amount future queries may vary from queries inthe given workload may be used to construct the expected workload model.

[0010] In an embodiment, samples are selected from regions that have arelatively great importance. Importance can be measured by the number ofqueries that select a given region and/or the number of queries in aregion. Regions of relatively low importance are merged with moreimportant regions in an exemplary embodiment.

[0011] In an embodiment, the number of samples allocated to each regionis rounded down to the nearest integer, the remaining fractional valuesare accumulated, and the accumulated fractional values are redistributedto regions such that the estimation error is impacted the least.

[0012] In one embodiment, one sample is chosen from each of the regionsand information about the region such as the sum of all records in theregion is associated with each sample.

[0013] In one embodiment, the step of executing the query is performedby joining a sample of a table and zero or more tables in the database.In an embodiment, the step of constructing a sample is performed byjoining at least two tables and accessing data records in the resultingjoin to construct the sample. In an embodiment, the table is dividedinto regions based on values of the aggregation attributes and eachregion is further partitioned based on the queries in the workload.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014]FIG. 1 illustrates an operating environment for estimating aresult to an aggregate query on a database by executing the query on asample that has been constructed to minimize error over an expectedworkload;

[0015]FIG. 2 illustrates a database system suitable for practice of anembodiment of the present invention;

[0016]FIG. 3 is a block diagram of a database system depicting asampling tool and query modifying tool in accordance with an embodimentof the present invention;

[0017]FIG. 4 is a flow diagram of a method for constructing a sample inaccordance with an embodiment of the present invention;

[0018]FIG. 5 is a Venn diagram depiction of a relation R being actedupon by an embodiment of the present invention;

[0019]FIG. 6 is a Venn diagram depiction of a relation R being actedupon by an embodiment of the present invention;

[0020]FIG. 7 is a flow diagram of a method for constructing an expectedworkload in accordance with an embodiment of the present invention; and

[0021]FIG. 8 is a Venn diagram depiction of a relation R being actedupon by an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0022] Estimating a result to an aggregate query by executing the queryon a sample that has been constructed to minimize error over an expectedworkload increases the accuracy of the estimate.

[0023] The subject matter of this patent application is disclosed in apaper presented at the ACM SIGMOD 2001 conference, “A Robust,Optimization-Based Approach for Approximate Answering of AggregateQueries” by Chaudhuri, Das, and Narasayya. This paper is hereinincorporated by reference.

[0024] Exemplary Embodiment for Practicing the Invention

[0025]FIG. 2 illustrates an example of a suitable client/server system10 for use with an exemplary embodiment of the invention. The system 10is only one example of a suitable operating environment for practice ofthe invention. The system includes a number of client computing devices12 coupled by means of a network 14 to a server computer 16. The server16 in turn is coupled to a database 18 that is maintained on a possiblylarge number of distributed storage devices for storing data records.The data records are maintained in tables that contain multiple numberof records having multiple attributes or fields. Relations betweentables are maintained by a database management system (DBMS) thatexecutes on the server computer 16. The database management system isresponsible for adding, deleting, and updating records in the databasetables and also is responsible for maintaining the relational integrityof the data. Furthermore, the database management system can executequeries and send snapshots of data resulting from those queries to aclient computer 12 that has need of a subset of data from the database18.

[0026] Data from the database 18 is typically stored in the form of atable. If the data is “tabular”, each row consists of a unique columncalled “case id” (which is the primary key in database terminology) andother columns with various attributes of the data.

[0027] Computer System

[0028] With reference to FIG. 1 an exemplary embodiment of the inventionis practiced using a general purpose computing device 20. Such acomputing device is used to implement both the client 12 and the server16 depicted in FIG. 2. The device 20 includes one or more processingunits 21, a system memory 22, and a system bus 23 that couples varioussystem components including the system memory to the processing unit 21.The system bus 23 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, and alocal bus using any of a variety of bus architectures.

[0029] The system memory includes read only memory (ROM) 24 and randomaccess memory (RAM) 25. A basic input/output system 26 (BIOS),containing the basic routines that helps to transfer information betweenelements within the computer 20, such as during start-up, is stored inROM 24.

[0030] The computer 20 further includes a hard disk drive 27 for readingfrom and writing to a hard disk, not shown, a magnetic disk drive 28 forreading from or writing to a removable magnetic disk 29, and an opticaldisk drive 30 for reading from or writing to a removable optical disk 31such as a CD ROM or other optical media. The hard disk drive 27,magnetic disk drive 28, and optical disk drive 30 are connected to thesystem bus 23 by a hard disk drive interface 32, a magnetic disk driveinterface 33, and an optical drive interface 34, respectively. Thedrives and their associated computer-readable media provide nonvolatilestorage of computer readable instructions, data structures, programmodules and other data for the computer 20. Although the exemplaryenvironment described herein employs a hard disk, a removable magneticdisk 29 and a removable optical disk 31, it should be appreciated bythose skilled in the art that other types of computer readable mediawhich can store data that is accessible by a computer, such as magneticcassettes, flash memory cards, digital video disks, Bernoullicartridges, random access memories (RAMs), read only memories (ROM), andthe like, may also be used in the exemplary operating environment.

[0031] A number of program modules may be stored on the hard disk,magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including anoperating system 35, one or more application programs 36, other programmodules 37, and program data 38. A user may enter commands andinformation into the computer 20 through input devices such as akeyboard 40 and pointing device 42. Other input devices (not shown) mayinclude a microphone, joystick, game pad, satellite dish, scanner, orthe like. These and other input devices are often connected to theprocessing unit 21 through a serial port interface 46 that is coupled tothe system bus, but may be connected by other interfaces, such as aparallel port, game port or a universal serial bus (USB). A monitor 47or other type of display device is also connected to the system bus 23via an interface, such as a video adapter 48. In addition to themonitor, personal computers typically include other peripheral outputdevices (not shown), such as speakers and printers.

[0032] The computer 20 may operate in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 49. The remote computer 49 may be another personal computer, aserver, a router, a network PC, a peer device or other common networknode, and typically includes many or all of the elements described aboverelative to the computer 20, although only a memory storage device 50has been illustrated in FIG. 1. The logical connections depicted in FIG.1 include a local area network (LAN) 51 and a wide area network (WAN)52. Such networking environments are commonplace in offices,enterprise-wide computer networks, intranets and the Internet.

[0033] When used in a LAN networking environment, the computer 20 isconnected to the local network 51 through a network interface or adapter53. When used in a WAN networking environment, the computer 20 typicallyincludes a modem 54 or other means for establishing communications overthe wide area network 52, such as the Internet. The modem 54, which maybe internal or external, is connected to the system bus 23 via theserial port interface 46. In a networked environment, program modulesdepicted relative to the computer 20, or portions thereof, may be storedin the remote memory storage device. It will be appreciated that thenetwork connections shown are exemplary and other means of establishinga communications link between the computers may be used.

[0034] Constructing a Sample for Estimating Aggregate Queries UsingStratified Sampling

[0035] Referring now to FIG. 3, a sampling tool 65 that accessesdatabase tables 61 and constructs samples 62 is shown. The samples 62are optimized for aggregate queries such as COUNT, SUM, and AVERAGE andare tuned to an expected workload. A workload is specified as a set ofqueries paired with their corresponding weights. A weight indicates theimportance of a query in the workload. The sampling tool 65 utilizes agiven database workload 63 and lifting parameters 64, which will bediscussed in detail later, to construct the expected workload andcorresponding samples 62. Each record in a sample is allowed to containa few additional columns (such as a scaling factor) with each record. Aquery rewriting tool 67 rewrites an incoming query 66 to execute on thesamples, if appropriate, and then executes the queries on the samples toprovide an answer set. An error estimate 68 of the estimated answer mayalso be provided along with the answer. To arrive at the answer set, thevalue(s) of the aggregate column(s) of each record in the sample arefirst scaled up by multiplying with the scaling factor and thenaggregated.

[0036]FIG. 4 illustrates a sampling method 100 for constructing samples62 in accordance with a preferred embodiment of the present invention.In general, the sampling method 100 is an adapted stratified samplingmethod. Consider relation R containing two columns <ProductId, Revenue>and four records {<1,10>, <2,10>, <3,10>, <4,1000>}. The following twoselection queries are executed on relation R: Q₁=SELECT COUNT(*)FROM RWHERE ProductId IN (3,4) and Q2=SELECT SUM(y) FROM WHERE ProductId IN(1, 2, 3). The population of a query Q (denoted by POP_(Q)) on arelation R is a set of size |R| that contains the returned value of theaggregated column of each record in R that is selected by Q or 0 if therecord is not selected. Therefore POP_(Q1)={0, 0, 1, 1} andPOP_(Q2)={10, 10, 10, 0}. For COUNT and SUM aggregates, the query can beanswered by summing up its population. Each query defines its ownpopulation of the same relation R, and therefore the challenge is topre-compute a sample that will work well for all populations (i.e.queries).

[0037] Classical sampling techniques, such as uniform sampling, do notdeal well with the problem of building a sample that works with multiplepopulations. Stratified sampling, on the other hand, is a generalizationof uniform sampling where a population is partitioned into multiplesubregions and samples are selected uniformly from each subregion, with“important” subregions contributing relatively more samples. In general,stratified sampling is effective when partial knowledge of thepopulation is leveraged to design subregions whose internal variance issmall. The method 100 uses the queries in the workload to determine howbest to stratify the relation. The method 100 finds the optimalstratification of a relation for a workload and divides availablerecords in the sample among the subregions.

[0038] For purposes of this description, a scheme is a partitioning of arelation R into r subregions containing n₁, . . . , n_(r) records (whereΣn_(j)=n), with k₁, k₂, . . . , k_(r) records uniformly sampled fromeach subregion (where Σk_(j)=k). In addition to the sampled recordsthemselves, to produce an approximate answer to the query the schemealso associates a scale factor with each record. Queries are answered byexecuting them on the sample instead of R. For a COUNT query, the scalefactor entries of the selected records are summed, while for a SUM(y)query the quantity y multiplied by the scale factor for each selectedrecord is summed. The method 100 will now be discussed in detail inconjunction with the flowchart of FIG. 4.

[0039] In step 110, a number of samples k is determined representing themaximum number of samples that can be stored due to memory constraints.The database table being sampled is partitioned into fundamental regionsbased on queries executed on the database (step 130). FIG. 5 illustratesa relation R partitioned into ten fundamental regions R₁-R₁₀ in responseto a workload consisting of queries Q₁-Q₄. A fundamental region isdefined as a maximal subset of records of R that are selected by thesame set of queries in the workload. Thus, for a given fundamentalregion there are no queries in the workload that select a proper subsetof records from that fundamental region. To identify fundamental regionsin relation R for a workload W consisting of selection queries, atechnique known as tagging is used. Tagging logically associates witheach record tεR an additional tag column that contains the list ofqueries in W that reference t. This column is separated out to form adifferent relation R′ because users do not want to change the schema oftheir database if avoidable. Also, it is significantly faster to updatethe tag column in a separate relation R′. Records in R′ have aone-to-one correspondence with records in R. This is done by includingthe key column(s) of R in R′. When a query Q Σ W is executed, for eachrecord in R required to answer Q the query id of Q is appended to thetag column of the corresponding record in R′. When R′ is sorted by thetag column, records belonging to the same fundamental region appeartogether.

[0040] To build the expression for the mean square error, MSE(p_({w})),for each query Q in W the algorithm has to visit each fundamentalregion. If there are q queries in W and R fundamental regions, theproduct q*R can become quite large. This scalability issue is handled byeliminating regions of low importance immediately after they have beenidentified. In step 130, the method removes regions with small f_(j)*n²_(j) values, where f_(j) represents the weighted number of queries thataccess this region. The term f_(j) measures the number of queries thatare affected by R_(j), while the expected error by not including theregion is proportional to n² _(j). For SUM queries, the importance ofeach region is f_(i)*Y_(i) where Y_(i) is the sum of the values of theaggregate column within the region. Depending on the nature of the queryinvolved, each fundamental region may be further partitioned intosubregions (step 140, discussed later in detail). For a workloadconsisting of count queries, it is not necessary to partition thefundamental regions into subregions, and the method proceeds to step145, in which an expected workload is derived.

[0041] Fixed Workload

[0042] For the case of a constant or fixed workload, the expectedworkload is derived in step 145 to be equivalent to the “given” workload(the workload the database has experienced during a past interval ofoperation). The expected workload may also be an expanded or “lifted”version of the actual workload as will be discussed in detail later inconjunction with FIGS. 7 and 8.

[0043] In step 150 the method 100 expresses the error incurred in usingthe sample in terms of the number of samples k_(j) assigned to eachregion. The method assumes that k₁. . . k_(R) are unknown variables suchthat Σk_(j)=K. In an exemplary embodiment, the error that results fromestimating a result rather than scanning the database to compute aresult is expressed as the MSE of the workload, a sum of the meansquared error for each query Q in the workload. Other types of errorscould be used in the practice of the present invention such as the rootmean squared error, the mean error over all queries in the workload, orthe maximum error over the workload, but for purposes of thisdiscussion, the MSE will be used.

[0044] For a workload consisting of count queries, the MSE(p_(w)) can beexpressed as a weighted sum of the MSE of each query in the workload.Details of this expression will be discussed later. In step 160, the ksamples are allocated in a manner that minimizes this error.Minimization is accomplished by partially differentiating with eachvariable and setting each result to zero. This gives rise to 2*ksimultaneous linear equations, which can be solved using an iterativetechnique based on the Gauss-Seidel method. For the particular case of afixed workload, exactly one sample is taken from each region. In step170 the calculated number of samples is selected from each region toconstruct the sample.

[0045] The sample that results from a fixed workload that assumes afuture workload equivalent to the given workload accurately estimatesqueries within the given workload, but may poorly estimate results ofqueries that deviate from the given workload. Building a sample based onan expected, rather than the given, workload produces more accurateestimations over a range of queries, some of which are outside the pastworkload. A technique that builds such an expected workload by “lifting”the given workload based on predetermined parameters follows.

[0046] Building an Expected Workload

[0047] The stratified sampling method 100 outlined in FIG. 4 can be mademore resilient to the situation when the expected workload derived instep 145 consists of queries that are similar but not identical to thepast workload. FIGS. 7 and 8 pertain to a method of constructing anexpected workload from a given workload and lifting parameters thatspecify the degree of similarity of expected queries to past queries.For the purposes of this description, two queries Q and Q′ are similarif the answer set for Q and Q′ have significant overlap. Let R_(Q) andR_(Q)′, respectively denote the set of records selected by Q and Q′ fromR. As the overlap between R_(Q) and R_(Q)′ increases, so does thesimilarity between Q and Q′.

[0048] Referring now to FIG. 7, a method of constructing an expectedworkload 700 is shown in flowchart format. This description will focuson the case of single table selection queries with aggregationcontaining either the SUM or COUNT aggregate. In the notation of thisdescription, p_({Q}) maps subsets of R to probabilities. For all R′⊂R,p_({Q})(R′) denotes the probability of occurrence of any query thatselects R′. In step 720 two parameters δ(½≦δ≦1) and γ(0≦γ≦½) thatcontrol the intended degree of similarity and dissimilarity respectivelyare selected. For example, as δ approaches 1 and γ approaches 0,incoming queries are predicted to be identical to queries in the givenworkload. As δ approaches 0.5 and γ approaches 0, incoming queries arepredicted to be subsets of the given workload's queries. These valuescan be input by a database manager or they may be calculated by themethod.

[0049] The method can calculate values for s δ and γ automatically. Inone embodiment, the workload W is divided into two equal halves calledthe training and test set. The two-dimensional space 0.5<δ<1, 0<γ<0.5 isdivided into a grid in which each dimension is divided into a fixednumber of intervals. For each point (δ, γ) in the grid a sample iscomputed for the training set and the error for the test set isestimated. The grid point with the lowest error is selected and used asthe setting for δ and γ.

[0050] In step 740 (FIG. 7), the method derives a model p_(w) for theexpected workload that assigns a probability of occurrence to eachpossible query based on the selected δ and γ. FIG. 8 shows a Venndiagram of R, RQ and R′, where n₁, n₂, n₃, and n₄ are the counts ofrecords in the regions indicated. The functional form of p{Q} withreference to FIG. 8 is:

P _({Q})(R′)=δ^(n) ^(₂) (1−δ)^(n) ^(₁) γ^(n) ^(₃) (1−γ)^(n) ^(₄)

[0051] Note that when n₂ or n₄ are large (i.e. the overlap is large),the probability of R′ is high, whereas if n₁ or n₃ are large (i.e. theoverlap is small), the probability is small. When δ approaches ½, theprobability of R′ approaches (½)^(n) (i.e. all subsets R′ are equallylikely to be selected), whereas when γ approaches 1, this probabilityrapidly drops to 0 (i.e. only subsets R′ that are very similar to R_(Q)are likely). Based on p_({Q}) has been p_(w) is derived (step 740 inFIG. 7) using the following equation:${p_{W}\left( R^{\prime} \right)} = {\sum\limits_{i}^{q}\quad {w_{i}{p_{\{ Q_{i}\}}\left( R^{\prime} \right)}}}$

[0052] While the method 700 of modeling an expected workload can be usedin any case where it is desirable to quantify a future workload, it willbe discussed in further detail with respect to the stratified samplingmethod depicted in FIG. 4.

[0053] Constructing a Sample Using Stratified Sampling and an ExpectedWorkload COUNT Queries

[0054] Given a probability distribution of queries P_(w), the MSE forthe distribution is defined as Σ_(Q) p(Q)*SE(Q), where p(Q) is theprobability of query Q and SE(Q) is the squared error of query Q.Referring back to FIG. 4, the stratified sampling method 100 can bepracticed using an expected workload determined by the modeling method700 depicted in FIG. 7. Steps 110-145 are not affected by theincorporation of an expected workload into the stratified samplingmethod 100 and for the case of a COUNT query step 140 is not necessary.In step 150, the error is expressed in terms of k_(j), the number ofterms assigned to each fundamental region or subregion. For a COUNTquery, the MSE is as follows:${{MSE}\left( p_{\{ Q\}} \right)} \approx \frac{{\sum\limits_{R_{j} \subseteq R_{Q}}\quad {\frac{n_{j}^{2}}{k_{j}}{\delta \left( {1 - \delta} \right)}}} + {\sum\limits_{{Rj} \subseteq {R\backslash R_{Q}}}\quad {\frac{n_{j}^{2}}{k_{j}}{\gamma \left( {1 - \gamma} \right)}}}}{\left( {{\sum\limits_{R_{j} \subseteq R_{Q}}{\delta \quad n_{j}}} + {\sum\limits_{R_{j} \subseteq {R\backslash R_{Q}}}{\gamma \quad n_{j}}}} \right)^{2}}$

[0055] As n, the number of records, gets larger, the approximation getsmore accurate. As δ approaches 1 and γ approaches 0, MSE(p_({Q}))approaches 0. This is because such a setting for δ and γ indicates thatthe queries expected in the workload are extremely similar to Q, i.e,likely to contain almost all records in the answer to Q, and almost norecord that does not belong to the answer to Q. Given thatMSE(p_(w))=Σ_(j)(α_(j)/k_(j)), where each α_(j) is a function of n₁, . .. , n_(r), δ, and γ, with α_(j) depending on n_(j), the frequency withwhich a fundamental region is accessed by queries in the workload isimplicitly accounted for by MSE(p_(w)). Σ_(j)(α_(j)/k_(j)) is minimizedsubject to Σ_(j)k_(j)=k if k_(j)=k*(sqrt(α_(j))/Σ_(i)sqrt(α_(j))). Thisprovides a closed-form and computationally inexpensive solution to theallocating step 160.

[0056] The values for the kj's determined in step 160 may be fractional.It is necessary that these values be integers so that a correspondingnumber of samples can be selected from each region. In an exemplaryembodiment, each k_(j) is rounded to └k_(j)┘. The leftover fractions areaccumulated, and redistributed in a manner that increases the MSE theleast.

[0057] If many kj's are small (<1), then after the rounding is performedthe allocation step 160 may assign many regions with no samples.Moreover, fundamental regions that have been pruned out for scalabilityreasons as discussed above will also not receive any samples. Due toboth these reasons, a non-negligible bias may be introduced into theestimates, i.e. the expected value of the answer may no longer be equalto the true answer. In an exemplary embodiment, the fundamental regionswith no allocated samples are merged with the other fundamental regionsinto super-regions such that the MSE is affected as little as possible.The merging of two fundamental regions accounts for the internalvariance in the values of the fundamental regions, the frequency withwhich a fundamental region is included by queries in the workload, andthe mean value of the aggregate column in a region. Since all of thefundamental regions in the relation are part of some super-regions, andeach super-region has one or more records assigned to it, the bias isreduced.

[0058] SUM Queries

[0059] Still referring to FIG. 4, the stratified sampling method 100 canbe practiced to estimate answers to SUM queries. The method must bemodified because for SUM queries, the variance of the data in theaggregated query must be taken into account. In step 140, thefundamental regions are divided into a set of h subregions having asignificantly lower internal variance that the region as a whole. Toperform step 140, a histogram for each fundamental region, whichapproximates the density distribution and stratification into hsubregions is accomplished in a single scan of R. A value of 5 for h isappropriate for practice of the method 100.

[0060] Once the subregions have been identified in step 140 and theexpected workload has been derived in step 145, the MSE(p_(Q)) in termsof the unknowns k₁, . . . , k_(h-r) is derived in step 150. For SUMqueries, the specific values of the aggregate column influenceMSE(p_({Q})). For an expected workload lifted from the existing workloadusing parameters δ and γ, the expected number of records picked by aquery from among the answer set of Q is d*n_(j). The expression forMSE(p_({Q})) takes into account the expected variance among the valuesin subsets of d*n_(j) records from each fundamental region R_(j) thatare within Q. This expected variance is denoted by S² _(δ,j). Likewise,S² _(g,j) denotes the corresponding expected variance for eachfundamental region that is outside Q. The formula for MSE(p_({Q})) for aSUM query Q in W is:${{MSE}\left( p_{\{ Q\}} \right)} \approx \frac{{\sum\limits_{R_{j} \subseteq R_{Q}}\quad {\frac{n_{j}^{2}}{k_{j}}\left( S_{\delta,j}^{2} \right)}} + {\sum\limits_{R_{j} \subseteq {R\backslash R_{Q}}}\quad {\frac{n_{j}^{2}}{k_{j}}\left( S_{\gamma,j}^{2} \right)}}}{\left( {{\sum\limits_{R_{j} \subseteq R_{Q}}{\delta \quad Y_{j}}} + {\sum\limits_{R_{j} \subseteq {R\backslash R_{Q}}}{\gamma \quad Y_{j}}}} \right)^{2}}$

[0061] Y_(j) is the sum of the aggregate column of all records in regionRj. The above approximation is true where the values in the aggregationcolumn are all strictly positive or negative. The formula does not holduniversally irrespective of the domain values in the aggregate columns.This is because there could be a query that selects a subset of R whoseSUM aggregate is zero (or extremely close to zero) but whose is variancelarge. Even though such a query may have a small probability ofoccurrence in the lifted distribution, if not answered exactly, itsrelative error can become infinite. Finally, as in the case of COUNT,the approximation converges to be an exact equality when n is relativelylarge.

[0062] Because the denominator in the expression for MSE(p_({Q})) abovecontains the square of the expected sum it is more difficult to computevalues for the k_(j)'s (step 160) than in the case of MSE(p_({Q})) forCOUNT queries already discussed. However, the computation can be done ina single scan of R by keeping track of the sum of values, sum squares ofvalues in each region, and n_(j) (number of records) in the region. Infact, this computation can be accomplished with the same scan of Rrequired for the stratification step 140. Once the k_(s) samples areselected from each subregions in step 170, the sample can be used toestimate SUM queries.

[0063] GROUP BY Queries

[0064]FIG. 6 illustrates a relation R upon which a GROUP BY is executed.Given a GROUP BUY query Q with weight w in the workload, Q partitions Rinto g groups: G₁, . . . , G_(g). Within each group G_(j), let S_(j) bethe set of records selected. To handle this case, the method 100 (FIG.4) replaces Q in the workload with g separate selection queries having aweight of w/g that select S₁, . . . , S_(g) respectively, and uses theworkload lifting method 700 to lift each query separately. In step 130the method 100 treats each GROUP BY query Q as a collection of gselection queries with aggregation, and tags the records with the groupthat they belong to. During the tagging process, for GROUP BY columns ofinteger data types a double <c,v> is appended in addition to the queryid, where c is the column id of the GROUP BY column and v is the valueof that column in record t. For non-integer data types the value of theGROUP BY column is treated as a string and a string hashing function isused to generate an integer value. As already described, when R′ issorted on the tag column, all records belonging to the same fundamentalregion appear together. The rest of the stratified sampling 100 methodcontinues as described in the text accompanying FIG. 4.

[0065] JOIN Queries

[0066] The method 100 can be extended to a broad class of queriesinvolving foreign key joins over multiple relations. A relation is afact relation in the schema if it references (i.e. contains the foreignkeys of) one or more reference relations but is not referenced by anyother relation. A relation is a dimension relation if it does notcontain foreign keys of any other relation. Thus a relation that isneither a fact relation nor a dimension relation must be referenced byone or more relations and must contain foreign keys of one or morerelations. A star query is a query that a) is over the join of exactlyone source relation and a set of dimension relations each of which isreferenced by the source relation; b) GROUP BY and aggregation over acolumn of the source relation; and c) may have selections on source andor dimension relations. Star queries are widely used in the context ofthe decision support queries. The method 100 handles star queries byobtaining a sample over the source relation according to the method 100.When a query is posed, the sample over the source relation is used tojoin the dimension relation s in their entirety with the sample tocompute the aggregate with the appropriate use of a scale factor. Thismethod is reasonable because typically the source relation is a largefact relation, while the other relations are smaller dimensionrelations.

[0067] A record t in the source relation is deemed useful for a JOINquery Q in the workload only if t contributes to at least one answerrecord of Q, i.e., t must successfully join with other dimensionrelations and satisfy all the selection conditions in the query as well.In step 130, the method 100 tags only the records from the fact relationthat join with the dimension relation and satisfy the selectionconditions in Q. The tagging step itself is no different from thetechnique used for single relation queries already described.Alternatively join synopses can be computed which results in reducedrun-time cost at the expense of increased storage requirements due toadditional columns from the join. Allocation of k in steps 150 and 160is done by setting up MSE(p_(w)) and minimizing it.

[0068] Heterogeneous Mix of Queries

[0069] To handle a workload that contains a mix of COUNT and SUM(y)queries, each term MSE(p_({Q})) is set up to reflect the type of query qin the workload since, as explained above the analysis for COUNT and SUMdiffer. Once these expressions are set up in step 150, minimizing theresulting MSE)(p_(w)) in step 160 can be accomplished. For a mix ofSUM(s) and SUM(y), where x and y are two columns from the same relation,each fundamental region is stratified into h subregions separately for xand y in step 140. These subregions are superimposed on one another.This approach can result in an explosion in the number of subregions. Analternative approach that is more scalable is to minimize the varianceof the more important aggregate column (SUM(x) for this description) andignore the data variance of the others. This is equivalent toeffectively replacing any occurrence of SUM(y), where x≠y as if it werethe COUNT(*) aggregate. For aggregates of form SUM(<expression>) (e.g.SUM(x*y+z*) the expression is treated as a new derived column c andoptimized for SUM(c). The technique can be expanded to handle cases whenthe workload consists of aggregation queries with nested sub-queries, asalso single-table selection queries with aggregation but where eachquery can potentially reference a different relation.

[0070] While the exemplary embodiments of the invention have beendescribed with a degree of particularity, it is the intent that theinvention include all modifications and alterations from the discloseddesign falling within the spirit or scope of the appended claims.

We claim:
 1. A method for estimating the result of an aggregation queryon a database wherein the database has data records arranged in tables,and wherein the database has a given workload comprising a set ofqueries that has been executed on the database, the method comprisingthe steps of: a) deriving an expected workload based on the givenworkload; b) constructing a sample by accessing the data records in atable to minimize an estimation error based on the expected workload; c)directing the query to access the sample; d) executing the query on thesample; and e) returning an estimated query result.
 2. The method ofclaim 1 further comprising the step of returning an estimated error ofthe estimated query result.
 3. The method of claim 1 wherein the step ofconstructing a sample is performed by partitioning rows of the tableinto regions based on the queries in the expected workload and selectingrecords from the regions in a manner that minimizes estimation errorover the expected workload.
 4. The method of claim 3 wherein the step ofselecting records from the regions is performed by expressing theestimation error as a function of the number of records allocated toeach region and allocating the number of allocated records to minimizethe estimation error.
 5. The method of claim 4 wherein the estimationerror is the mean squared error.
 6. The method of claim 3 wherein thestep of partitioning rows of the table into regions is performed bygrouping data records such that no query in the given workload selects aproper subset of the rows of any region.
 7. The method of claim 1wherein the expected workload is derived by assigning a probability ofoccurrence to each possible query that can be executed on the table. 8.The method of claim 7 wherein the probability of occurrence assigned toeach query is related to an amount of similarity between the query andqueries in the given workload.
 9. The method of claim 1 wherein theexpected workload is derived using predetermined constants that arerelated to the amount regions selected by future queries are predictedto overlap regions selected by queries in the given workload.
 10. Themethod of claim 3 wherein records are selected from regions that have arelatively high importance.
 11. The method of claim 10 wherein theimportance of a region is related to a number of queries that select theregion.
 12. The method of claim 10 wherein the importance of a region isrelated to a number of records contained in the region.
 13. The methodof claim 3 wherein regions are merged with other regions based on amerge criteria.
 14. The method of claim 4 comprising the steps ofrounding the number of records allocated to each region down to thenearest integer, accumulating remaining fractional values, andredistributing the accumulated fractional values to regions such thatthe estimation error is impacted the least.
 15. The method of claim 1wherein information about the region to which a record belongs isassociated with each record in the sample.
 16. The method of claim 15wherein the information comprises a scaling factor.
 17. The method ofclaim 3 comprising the step of subdividing each region into subregionssuch that the records in each subregion have similar values and whereinrecords are selected from subregions in a manner that minimizesestimation error.
 18. The method of claim 1 wherein the step ofexecuting the query on the sample comprises performing a join operationof the sample of a table with other tables in the database.
 19. Themethod of claim 1 wherein the step of constructing a sample comprisesperforming a join of at least two tables in the database and accessingdata records in the resulting join to construct the sample.
 20. Themethod of claim 3 comprising the step of partitioning each region intoregions based on the queries in the workload and further dividing thetable into regions based on values of an attribute.
 21. The method ofclaim 1 wherein the expected workload is identical to the givenworkload.
 22. The method of claim 3 wherein one record is chosen fromeach of the regions.
 23. The method of claim 15 wherein the additionalinformation is the sum of all records in the region.
 24. The method ofclaim 15 wherein the additional information is the count of all therecords in the region.
 25. A method for estimating the result of a queryon a database wherein the database has data records arranged in tables,and wherein the database has a given workload comprising a set ofqueries that has been executed on the database, the method comprisingthe steps of: a) deriving an expected workload based on the givenworkload; b) constructing a sample of a table by partitioning the datarecords into regions based on the queries in the expected workload suchthat no query in the expected workload selects a proper subset of anyregion, expressing the estimation error as a function of a number ofrecords selected per region, and allocating a number of records to eachregion that minimizes the estimation error; c) directing the query toaccess the sample; d) executing the query on the sample; and e)returning an estimated query result.
 26. The method of claim 25 whereinthe expected workload is a statistical model that is derived byassigning a probability of occurrence to each possible query.
 27. Themethod of claim 26 wherein the probability of occurrence assigned toeach query is related to an amount similarity between the query andqueries in the given workload.
 28. A method for constructing a sample ofrecords in a database table contained in a database having a givenworkload comprising a set of queries that has been accessed on thedatabase, the method comprising the steps of: a) partitioning thedatabase into regions based on the workload; b) assigning a number ofrecords to be selected to at least a subset of the regions such that theestimation error is minimized; and c) selecting the assigned number ofrecords from each of the regions to form the sample.
 29. The method ofclaim 28 comprising the step of deriving an expected workload based onthe given workload.
 30. The method of claim 28 wherein the partitioningstep is performed by grouping records selected by queries in the givenworkload into regions such that no query selects a proper subset of anyof the regions.
 31. The method of claim 28 wherein the assigning step isperformed by expressing the estimation error in terms of the number ofrecords assigned to each region and assigning a number of records toeach region in a manner that minimizes the estimation error.
 32. Themethod of claim 28 wherein the estimation error is the mean squarederror.
 33. The method of claim 28 wherein one record is selected fromeach region.
 34. The method of claim 28 wherein information about theregion is associated with each record selected from the region.
 35. Themethod of claim 34 wherein the information a scaling factor.
 36. Themethod of claim 28 wherein records are selected from regions have arelatively high importance.
 37. The method of claim 36 wherein theimportance of a region is related to a number of queries that select theregion.
 38. The method of claim 36 wherein the importance of a region isrelated to a number of records in the region.
 39. The method of claim 36wherein regions that have a relatively importance are merged with theregions that have a relatively high importance.
 40. The method of claim28 comprising the steps of rounding the number of records assigned toeach region down to the nearest integer, accumulating remainingfractional values, and redistributing the accumulated fractional valuesto regions such that the estimation error is impacted the least.
 41. Themethod of claim 28 comprising the step of subdividing each region intosubregions such that the records in each subregion have similar valuesand wherein records are selected from subregion in a manner thatminimizes estimation error.
 42. The method of claim 28 comprisingperforming a join of at least two tables and partitioning data recordsin the join.
 43. The method of claim 28 comprising the step ofpartitioning each region into regions based on the queries in theworkload and further dividing the table into regions based on values ofan attribute.
 44. The method of claim 28 wherein the expected workloadis a statistical model that assigns a probability of occurrence to eachpossible query.
 45. A method for constructing a sample of records in adatabase table contained in a database having a fixed workloadcomprising a set of queries that has been executed on the database, themethod comprising the steps of: a) partitioning the database intoregions based on the workload; b) selecting a single record from eachregion; and c) storing the record and additional information about theregion to construct the sample.
 46. The method of claim 45 wherein theadditional information comprises a sum of data values in the region. 47.The method of claim 45 wherein the additional information comprises acount of records in the region.
 48. A method for constructing a model ofan expected database workload based on a given workload comprising a setof queries that have been executed on the database, the methodcomprising the steps of: a) examining the given workload; b) assigning aprobability of occurrence to a query that is based on an amount ofsimilarity between the query and the given workload; and c) deriving anexpected workload model that assigns a probability of occurrence to eachpossible query.
 49. The method of claim 48 wherein the probability ofoccurrence assigned to a query is based on an amount of overlap betweena database region selected by the query and regions selected by queriesin the given workload.
 50. The method of claim 48 wherein theprobability of occurrence assigned to each query is related to an amountof similarity between the query and queries in the given workload. 51.The method of claim 48 wherein the expected workload is derived usingpredetermined constants that are related to the amount regions selectedby future queries are predicted to overlap regions selected by queriesin the given workload.
 52. The method of claim 51 wherein thepredetermined constants are selected by dividing the given workload intoa training and test portion and choosing values of the constants suchthat use of the constants in conjunction with the training portion bestpredicts the test portion.
 53. The method of claim 52 wherein the valuesof the constants are chosen by assigning combinations of possibleconstant values as points in a grid space, applying each point to thetraining portion, and selecting the point that best predicts the testportion.
 54. A computer readable medium containing computer executableinstructions for estimating the result of an aggregation query on adatabase wherein the database has data records arranged in tables, andwherein the database has a given workload comprising a set of queriesthat has been executed on the database, the instructions comprising thesteps of: a) deriving an expected workload based on the given workload;b) constructing a sample by accessing the data records in a table tominimize an estimation error based on the expected workload; c)directing the query to access the sample; d) executing the query on thesample; and e) returning an estimated query result.
 55. The computerreadable medium of claim 54 wherein the step of constructing a sample isperformed by partitioning rows of the table into regions based on thequeries such that no query in the expected workload selects a propersubset of any region in the expected workload and selecting records fromthe regions in a manner that minimizes estimation error over theexpected workload.
 56. The computer readable medium of claim 55 whereinthe step of selecting records from the regions is performed byexpressing the estimation error as a function of the number of recordsallocated to each region and allocating the number of allocated recordsto minimize the estimation error.
 57. The computer readable medium ofclaim 54 wherein the expected workload is a statistical model thatassigns a probability of occurrence to each possible query that can beexecuted on the table.
 58. The computer readable medium of claim 54wherein the expected workload is derived using predetermined constantsthat are related to the amount regions selected by future queries arepredicted to overlap regions selected by queries in the given workload.59. The computer readable medium of claim 54 wherein a scaling factor isassociated with each record in the sample.
 60. The computer readablemedium of claim 55 comprising the step of subdividing each region intosubregions such that the records in each subregion have similar valuesand wherein records are selected from subregions in a manner thatminimizes estimation error.
 61. The computer readable medium of claim 54wherein the expected workload is identical to the given workload. 62.The computer readable medium of claim 55 wherein one record is chosenfrom each of the regions.
 63. A computer readable medium containingcomputer executable instructions for constructing a model of an expecteddatabase workload based on a given workload comprising a set of queriesthat have been executed on the database, the instructions comprising thesteps of: a) examining the given workload; b) assigning a probability ofoccurrence to a query that is based on an amount of similarity betweenthe query and the given workload; and c) deriving an expected workloadmodel that assigns a probability of occurrence to each possible query.64. The computer readable medium of claim 63 wherein the probability ofoccurrence assigned to a query is based on an amount of overlap betweena database region selected by the query and regions selected by queriesin the given workload.
 65. The computer readable medium of claim 63wherein the expected workload is derived using predetermined constantsthat are related to the amount regions selected by future queries arepredicted to overlap regions selected by queries in the given workload.66. The computer readable medium of claim 65 wherein the values of theconstants are chosen by assigning combinations of possible constantvalues as points in a grid space, applying each point to the trainingportion, and selecting the point that best predicts the test portion.67. An apparatus for estimating the result of an aggregation query on adatabase wherein the database has data records arranged in tables, andwherein the database has a given workload comprising a set of queriesthat has been executed on the database, the apparatus comprising: a)deriving an expected workload based on the given workload; b) a sampleconstructor for accessing the data records in a table to minimize anestimation error based on the expected workload to construct a sample;c) a query redirector for directing the query to access the sample; andd) a query estimator for returning an estimated query result e)
 68. Theapparatus of claim 67 comprising an error estimator for determining anestimated error for the estimated query result.
 69. The apparatus ofclaim 67 wherein the constructor partitions rows of the table intoregions based on the queries in the expected workload and selectsrecords from the regions in a manner that minimizes estimation errorover the expected workload.
 70. The apparatus of claim 69 wherein theconstructor selects records from the regions by expressing theestimation error as a function of the number of records allocated toeach region and allocating the number of allocated records to minimizethe estimation error.
 71. The apparatus of claim 69 wherein theconstructor partitions rows of the table into regions by grouping datarecords such that no query in the given workload selects a proper subsetof any region.
 72. The apparatus of claim 69 wherein the constructorassociates a scaling factor with each record in the sample.
 73. Theapparatus of claim 69 wherein the constructor subdivides each regioninto subregions such that the records in each subregion have similarvalues and wherein the constructor selects records from the subregionsin a manner that minimizes estimation error.
 74. The apparatus of claim67 wherein the estimator performs a join operation of the sample of atable with other tables in the database.
 75. The apparatus of claim 67wherein the constructor joins at least two tables in the database andaccesses data records in the resulting join to construct the sample. 76.An apparatus for constructing a model of an expected database workloadbased on a given workload comprising a set of queries that have beenexecuted on the database, the apparatus comprising: a) a monitoringmeans for logging the given workload; b) a modeling means for assigninga probability of occurrence to a query that is based on an amount ofsimilarity between the query and the given workload; and c) aconstructing means for constructing an expected workload model thatassigns a probability of occurrence to each possible query.
 77. Theapparatus of claim 76 wherein the modeling means assign assigns theprobability to a query is based on an amount of overlap between adatabase region selected by the query and regions selected by queries inthe given workload.
 78. The apparatus of claim 76 wherein theconstructing means constructs the expected workload using predeterminedconstants that are related to the amount regions selected by futurequeries are predicted to overlap regions selected by queries in thegiven workload.